Process mining is one hot top which has attracted intense analysis in recent years and has a broad range of applications across different industries.
Process Mining is a process analysis method that aims to discover, monitor and improve real processes by extracting knowledge easily from available event logs in the systems of current information of an organization.
http://www.bupar.net/index.html
bupaR is an open-source, integrated suite of R-packages for the handling and analysis of business process data. It currently consists of 8 packages, including the central package, supporting different stages of a process mining workflow.
Overview of the eventlog data
## Log of 262200 events consisting of:
## 13087 cases
## 262200 instances of 24 activities
## 69 resources
## Events occurred from 2011-09-30 22:38:44 until 2012-03-14 15:04:54
##
## Variables were mapped as follows:
## Case identifier: CASE_concept_name
## Activity identifier: activity_id
## Resource identifier: resource_id
## Activity instance identifier: activity_instance_id
## Timestamp: timestamp
## Lifecycle transition: lifecycle_id
##
## # A tibble: 262,200 x 9
## CASE_concept_na… CASE_AMOUNT_REQ CASE_REG_DATE activity_id lifecycle_id
## <chr> <chr> <chr> <fct> <fct>
## 1 173688 20000 2011-10-01T0… A_SUBMITTED COMPLETE
## 2 173688 20000 2011-10-01T0… A_PARTLYSU… COMPLETE
## 3 173688 20000 2011-10-01T0… A_PREACCEP… COMPLETE
## 4 173688 20000 2011-10-01T0… W_Complete… SCHEDULE
## 5 173688 20000 2011-10-01T0… W_Complete… START
## 6 173688 20000 2011-10-01T0… A_ACCEPTED COMPLETE
## 7 173688 20000 2011-10-01T0… O_SELECTED COMPLETE
## 8 173688 20000 2011-10-01T0… A_FINALIZED COMPLETE
## 9 173688 20000 2011-10-01T0… O_CREATED COMPLETE
## 10 173688 20000 2011-10-01T0… O_SENT COMPLETE
## # … with 262,190 more rows, and 4 more variables: resource_id <fct>,
## # timestamp <dttm>, activity_instance_id <chr>, .order <int>
Extract objects from Event Log
cases <- events %>% cases() # 13087 cases
activities <- events %>% activities() # 24 unique activities
resources <- events %>% resources() # 69 unique resources
traces <- events %>% traces() # 4366 unique traces
n_cs <- nrow(cases)
cases$final_status <- ""
for (x in 1:n_cs) {
if(regexpr("A_DECLINED",cases$trace[x]) > 0)
cases$final_status[x] <- "DECLINED"
else if(regexpr("A_CANCELLED",cases$trace[x]) > 0)
cases$final_status[x] <- "CANCELLED"
else if(regexpr("A_APPROVED",cases$trace[x]) > 0 | regexpr("A_REGISTERED",cases$trace[x]) > 0 | regexpr("A_ACTIVATED",cases$trace[x]) > 0)
cases$final_status[x] <- "SUCCEED"
else
cases$final_status[x] <- "OTHER"
}
cat("There are ", nrow(cases), " cases; ", nrow(activities), " unique activities; ", nrow(resources), " unique resources; ", nrow(traces), " unique traces;")## There are 13087 cases; 24 unique activities; 69 unique resources; 4366 unique traces;
| CASE_concept_name | trace_length | number_of_activities | start_timestamp | complete_timestamp | trace | trace_id | duration_in_days | first_activity | last_activity | final_status |
|---|---|---|---|---|---|---|---|---|---|---|
| 173697 | 3 | 3 | 2011-10-01 06:11:08 | 2011-10-01 06:11:46 | A_SUBMITTED,A_PARTLYSUBMITTED,A_DECLINED | 1 | 0.0004347 | A_SUBMITTED | A_DECLINED | DECLINED |
| 173700 | 3 | 3 | 2011-10-01 06:15:39 | 2011-10-01 06:16:21 | A_SUBMITTED,A_PARTLYSUBMITTED,A_DECLINED | 1 | 0.0004762 | A_SUBMITTED | A_DECLINED | DECLINED |
| 173703 | 9 | 5 | 2011-10-01 07:45:25 | 2011-10-01 11:02:12 | A_SUBMITTED,A_PARTLYSUBMITTED,A_PREACCEPTED,W_Completeren aanvraag,W_Completeren aanvraag,W_Completeren aanvraag,W_Completeren aanvraag,A_CANCELLED,W_Completeren aanvraag | 1897 | 0.1366522 | A_SUBMITTED | W_Completeren aanvraag | CANCELLED |
| 173727 | 3 | 3 | 2011-10-01 10:08:46 | 2011-10-01 10:09:30 | A_SUBMITTED,A_PARTLYSUBMITTED,A_DECLINED | 1 | 0.0005058 | A_SUBMITTED | A_DECLINED | DECLINED |
| 173733 | 6 | 4 | 2011-10-01 10:39:34 | 2011-10-01 12:54:56 | A_SUBMITTED,A_PARTLYSUBMITTED,W_Afhandelen leads,W_Afhandelen leads,A_DECLINED,W_Afhandelen leads | 2927 | 0.0940082 | A_SUBMITTED | W_Afhandelen leads | DECLINED |
p1 <-
cases %>% ggplot(mapping = aes(x = trace_length)) + geom_density() + geom_vline(aes(xintercept = mean(trace_length)),colour = anz_color1) +
labs(x = "Length of Trace", y = "",title = "Distribution of Trace Length") + theme_minimal()
p2 <-
cases %>% ggplot(mapping = aes(x = number_of_activities)) + geom_density() + geom_vline(aes(xintercept = mean(number_of_activities)),colour = anz_color1) +
labs(x = "Number of Activities", y = "",title = "Distribution of Activities Number") + theme_minimal()
p3 <-
cases %>% filter(final_status != "OTHER") %>% ggplot(aes(x = factor(final_status), y = duration_in_days)) + geom_boxplot() +
labs(x = "Final Status", y = "",title = "Boxplot of Duration in Days") + theme_minimal()
p4 <- events %>%
trace_length("log") %>%
plot
multiplot(p1, p4, p2,p3, cols = 2)## The application process is always started with A_SUBMITTED, and can be end with 11 different activities, include A_DECLINED/A_CANCELLED/A_REGISTERED/
activities <- activities %>%
mutate(act_type = ifelse(substr(activities$activity_id,1,2) == "W_",
"workItem",
ifelse(substr(activities$activity_id,1,2) == "A_",
"application",
ifelse(substr(activities$activity_id,1,2) == "O_",
"offer",
"other"))))
p1 <-
activities %>%
ggplot(mapping = aes(x = activity_id, y = absolute_frequency, fill = act_type)) +
geom_bar(position = "dodge", stat = "identity") +
labs(x = "Activities", y = "Absolute Frequency", title = "View at avtivity level") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
scale_y_continuous(labels = scales::comma) +
geom_text(aes(label = absolute_frequency),vjust = 1.6, colour = "white", size = 2) +
geom_vline(xintercept = c(3,4,5),colour = anz_color1, size = .5)
p2 <-
activities %>%
filter(activity_id %in% c("A_APPROVED","A_CANCELLED","A_DECLINED")) %>%
ggplot(mapping = aes(x = activity_id, y = absolute_frequency)) +
geom_bar(position = "stack", stat = "identity", fill = anz_color1) +
labs(x = "End states of applications", y = "Absolute Frequency", title = "Distribution of application final result") +
theme_minimal() +
scale_y_continuous(labels = scales::comma) +
geom_text(aes(label = absolute_frequency),vjust = 1.6, colour = "white", size = 4)
# multiplot(p1, p2,cols = 1)
p1Application Events (A_)
Refers to states of the application itself.
* A_SUBMITTED / A_PARTLYSUBMITTED - Initial application submission
* A_PREACCEPTED - Application pre-accepted but requires additional information
* A_ACCEPTED - Application accepted and pending screen for completeness
* A_FINALIZED - Application finalized after passing screen for completeness
* A_APPROVED / A_REGISTERED / A_ACTIVATED - End state of successful (approved) applications
* A_CANCELLED / A_DECLINED - End states of unsuccessful applications
Offer Events (O_)
Refers to states of an offer communicated to the customer.
* O_SELECTED - Applicant selected to receive offer
* O_PREPARED / O_SENT - Offer prepared and transmitted to applicant
* O_SENT BACK - Offer response received from applicant
* O_ACCEPTED - End state of successful offer
* O_CANCELLED / O_DECLINED - End states of unsuccessful offers
Work item Events (W_)
Refers to states of work items that occur during the approval process.These events capture most of the manual effort exerted by Bank’s resources during the application approval process. The events describe efforts during various stages of the application process.
* W_Afhandelen leads - Following up on incomplete initial submissions
* W_Completeren aanvraag - Completing pre-accepted applications
* W_Nabellen offertes - Follow up after transmitting offers to qualified applicants
* W_Valideren aanvraag - Assessing the application
* W_Nabellen incomplete dossiers - Seeking additional information during assessment phase
* W_Beoordelen fraude - Investigating suspect fraud cases
* W_Wijzigen contractgegevens - Modifying approved contracts
Names and Descriptions of Transitions in the Work Item Life Cycle * SCHEDULE - Indicates a work item has been scheduled to occur in the future * START - Indicates the opening / commencement of a work item * COMPLETE - Indicates the closing / conclusion of a work item
Each resource_id represent an employee who involved in the application process. Resource 53 works very hard, seems like it’s a robot.
p1 <-
resources %>% filter(absolute_frequency > 7000) %>%
ggplot(mapping = aes(x = resource_id, y = absolute_frequency)) +
geom_bar(position = "dodge", stat = "identity", fill = anz_color1) +
labs(x = "Resources", y = "Absolute Frequency", title = "View at resource level") +
theme_minimal() +
scale_y_continuous(labels = scales::comma) +
geom_text(aes(label = absolute_frequency),vjust = 1.6, colour = "white", size = 3)
p2 <-
events %>% filter(resource_id == 53, activity_id %in% c("A_SUBMITTED","A_PARTLYSUBMITTED","W_Afhandelen leads","W_Completeren aanvraag","A_DECLINED","A_CANCELLED")) %>%
ggplot(mapping = aes(x = activity_id)) + geom_bar(fill = anz_color1) +
labs(x = "Activities", y = "Absolute Frequency", title = "Activity Distribution of Resource 53") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
multiplot(p1, p2,cols = 2)## There are 4366 unique traces; Let's look at the most frequent ones. (coverage = 50%)
## 26.2% applications are declined directly; 14.3% applications are declined beacuse the applicant can't(or not able to) complete the online application
Let’s only look at the process at application level:
events %>% filter_activity(c("A_SUBMITTED","A_PARTLYSUBMITTED","A_DECLINED","A_PREACCEPTED","A_ACCEPTED","A_FINALIZED","A_CANCELLED","A_ACTIVATED","A_APPROVED","A_REGISTERED")) %>%
process_map()final_status2 <- c("A_CANCELLED","A_Succeed","A_DECLINED")
fincal_count <- c(2807,2246,7635)
wf_data2 <- data.frame(final_status2,fincal_count)
wf_data2 %>%
ggplot(aes(x = final_status2, y = fincal_count)) + geom_bar(fill = anz_color1, position = "dodge", stat = "identity") +
labs(x = "Final Status", y = "Count", title = "Final status of cases") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1))We can deep dive into the process step by step (application level activities) to get more detialed information, e.g. the following diagram shows processes between A_PARTLYSUBMITTED and A_PREACCEPTED
se_process_1 <- events %>% filter_trim(start_activities = "A_PARTLYSUBMITTED",end_activities = "A_PREACCEPTED") %>% process_map()
se_process_2 <- events %>% filter_trim(start_activities = "A_PREACCEPTED",end_activities = "A_ACCEPTED") %>% process_map()
se_process_3 <- events %>% filter_trim(start_activities = "A_ACCEPTED",end_activities = "A_FINALIZED") %>% process_map()
se_process_4 <- events %>% filter_trim(start_activities = "A_FINALIZED",end_activities = c("A_ACTIVATED")) %>% process_map()
se_process_5 <- events %>% filter_trim(start_activities = "A_FINALIZED",end_activities = c("A_APPROVED")) %>% process_map()
se_process_6 <- events %>% filter_trim(start_activities = "A_FINALIZED",end_activities = c("A_REGISTERED")) %>% process_map()
se_process_1According to the conclusion above, we can simplify the process map as below by collapsing the reduntant activities:
events %>% filter_activity(c("A_SUBMITTED","A_PARTLYSUBMITTED","A_DECLINED","A_PREACCEPTED","A_ACCEPTED","A_FINALIZED","A_CANCELLED","A_ACTIVATED","A_APPROVED","A_REGISTERED")) %>%
act_collapse(A_Success = c("A_ACTIVATED","A_APPROVED","A_REGISTERED"),
A_SUBMITTED = c("A_SUBMITTED","A_PARTLYSUBMITTED")) %>%
process_map()The following diagram clearly shows the direct relationship betweeb two activities
events %>% filter_activity(c("A_SUBMITTED","A_PARTLYSUBMITTED","A_DECLINED","A_PREACCEPTED","A_ACCEPTED","A_FINALIZED","A_CANCELLED","A_ACTIVATED","A_APPROVED","A_REGISTERED")) %>%
precedence_matrix(type = "absolute") %>% plot()Accordingly, I have the following diagram to simplely show the standard process flow at application level and the number of cases at each phase:
Now, I will deep dive into the declined cases, to discover the reasons of decline. From the above chart, we can know 5719 out of 7635 cases are declined immediately right after submission online.
Let’s check the other declined cases:
events_declined <-
events %>%
filter_activity(c("A_PARTLYSUBMITTED","A_DECLINED","A_PREACCEPTED","A_ACCEPTED","A_FINALIZED","W_Afhandelen leads","W_Completeren aanvraag",
"W_Nabellen offertes","W_Valideren aanvraag","W_Nabellen incomplete dossiers","W_Beoordelen fraude","W_Wijzigen contractgegevens")) %>%
filter_trim(start_activities = "A_PARTLYSUBMITTED",end_activities = c("A_DECLINED"))
events_declined %>% process_map(type = frequency())Illustrate as:
In General, we can get the following chart to clearly show the applications final status - only 2246 out of 13087 got succeed.
x <- list("Total","CANCELLED","PREACCEPTED","Direct_Decline","Incomplete","Suspect_fraud","Completeren","Nabellen","Assessing","Qualified","ACCEPTED","Succeed")
measure <- c("Total","relative","relative","relative","relative","relative","relative","relative","relative","relative","relative","relative")
text <- c("13087","-2807","-399","-3429","-2234","-57","-1088","-86","-668","-48","-25","2246")
y <- c(13087,-2807,-399,-3429,-2234,-57,-1088,-86,-668,-48,-25,-2246)
wf_data <- data.frame(x=factor(x,levels = x),measure,text,y)
p <- plot_ly(
wf_data, name = "20", type = "waterfall", measure = ~measure,
x = ~x, textposition = "outside", y= ~y, text =~text,
connector = list(line = list(color= "rgb(63, 63, 63)"))) %>%
layout(title = "Applications final status",
xaxis = list(title = ""),
yaxis = list(title = ""),
autosize = TRUE,
showlegend = TRUE)
pThrough comprehensive analysis of the event log, we managed to convert a data set containing 262,200 events and 13,087 cases into a clearly interpretable, end-to-end workflow for a loan and overdraft approvals process. I suggest the improvements:
* 1. Simplify the process by removing reduntant / mixed activities, e.g. A_SUBMITTED / A_PARTLYSUBMITTED; A_APPROVED / A_REGISTERED / A_ACTIVATED;
* 2. Totally there are 4366 traces, from the trace explorer, 50% of the traces are covered by 12 types, and longest one which contains only 14 activities. There are space to optimize the process.
* 3. Refine the automation assessment process after submitting application online; Totally 58% cases are declined, 44% are directly declined.
* 4. Duration_in_days of declined application is 2.0 days, however the duration_in_days of succeed and cancelled application is 16.7/18.5 days.
Heuristics Miner is an algorithm that acts on the Directly-Follows Graph, providing way to handle with noise and to find common constructs (dependency between two activities, AND). The output of the Heuristics Miner is an Heuristics Net, so an object that contains the activities and the relationships between them.
A frequency based metric is used to indicate how certain that there is truly a dependency relation between two events/activities A and B (notation A ⇒W B). Let W be an event log over T, and a, b ∈ T. Then |a >W b| is the number of times a >W b occurs in W, and:
An sample to explain the mathematical formula
If we use this definition in the situation that, in 5 traces, activity A is directly followed by activity B but the other way around never occurs, the value of A ⇒W B = 5/6 = 0.833 indicating that we are not completely sure of the dependency relation (only 5 observations possibly caused by noise).
However if there are 50 traces in which A is directly followed by B but the other way around never occurs, the value of A ⇒W B = 50/51 = 0.980 indicates that we are pretty sure of the dependency relation.
Dependency graph of application process
Causal graph / Heuristics net